Bayesian Linear Regression

Analysis of Flight Delay Data

Sara Parrish, Heather Anderson (Advisor: Dr. Seals)

Nov 17, 2024

Introduction

Objectives

  • Introduce Bayesian Linear Regression (BLR): Understand its principles and how it differs from traditional methods.

  • Explain Bayesian Concepts: Highlight Bayes’ Theorem, prior knowledge, and posterior distributions.

  • Discuss Practical Applications: Show how BLR is applied in analyzing real-world data, like airline delays.

  • Explore Advantages of Bayesian Methods: Quantifying uncertainty, improving predictions, and handling complex data.

  • Present Analysis Findings: Summarize key insights from our BLR model on weather-related airline delays.

What is Bayesian Linear Regression?

  • BLR: A statistical approach combining prior knowledge and new data.

  • Goal: Model relationships, make predictions, and handle uncertainty in estimates.

  • Difference from Traditional Methods: Probability-based estimates instead of fixed values.

Introduction to Bayesian Linear Regression

  • Regression under the frequentist framework
    • Independent variables are used to predict dependent variables
    • Linear regression finds best-fitting line to observed data to make further predictions
      • Regression parameters (\beta) are assumed to be fixed
    • Only collected data is used for approximation
  • Regression under the Bayesian framework
    • Independent variables are used to predict dependent variables
    • Regression parameters (\beta) are not assumed to be fixed
    • Collected data is used alongside prior knowledge for approximation

Why Bayesian?

Advantages of Bayesian Linear Regression[1]

  • Incorporation of Prior Knowledge

  • Uncertainty Quantification

  • Expanded Hypotheses

  • Automatic Meta-Analyses

  • Improved Handling of Small Samples

  • Complex Model Estimation

Steps in Bayesian Linear Regression

  1. Model Specification: Define the linear relationship between the dependent and independent variables.

  2. Choose Priors: Select prior distributions for the model parameters, reflecting any existing knowledge about their values.

  3. Data Collection: Gather relevant data for the variables in the model.

  4. Model Fitting: Use computational methods, such as Markov Chain Monte Carlo (MCMC), to estimate the posterior distributions of the parameters based on the observed data.

  5. Result Interpretation: Analyze the posterior distributions to understand the relationships between variables, including estimating means and credible intervals.

Methods

Heather’s Prior Selection & Model Specification

Prior Selection

  • Intercept (\beta_0): \beta_0 \sim N(0, 5^2) Assumes no strong baseline effect.

  • Slope (\beta_1): \beta_1 \sim N(0, 5^2) Reflects no strong prior belief about the relationship between weather incidents and delays.

  • Error Term (\sigma): \sigma \sim \text{Exp}(1) Accounts for variability in delays; allows flexibility.

Model Specification

Y_i \mid \beta_0, \beta_1, \sigma \sim N(\mu_i, \sigma^2) \mu_i = \beta_0 + \beta_1 X_i

  • Y_i: Arrival delay (minutes)
  • X_i: Weather-related incidents

Analysis & Results

Meet My Dataset!

Variable Description
year The year of the data.
month The month of the data.
carrier Carrier code.
carrier_name Carrier name.
airport Airport code.
airport_name Airport name.
arr_flights Number of arriving flights.
arr_del15 Flights delayed by 15+ minutes.
carrier_ct Carrier-caused delays.
weather_ct Weather-caused delays.
nas_ct NAS-related delays.
security_ct Security-caused delays.
late_aircraft_ct Delays from late aircraft.
arr_cancelled Number of canceled flights.
arr_diverted Number of diverted flights.
arr_delay Total arrival delay.
carrier_delay Delay attributed to the carrier.
weather_delay Delay attributed to weather.
nas_delay Delay attributed to the NAS.
security_delay Delay attributed to security.
late_aircraft_delay Delay from late-arriving aircraft.

Exploring the Data

Exploring the Data

Choosing Focus

Table 1: Summary of Flight Arrivals, Delays, Cancellations, and Diversions

Table 1: Summary of Flight Arrivals, Delays, Cancellations, and Diversions (August Data)
Characteristic Value
Total Months of Data (August) 1.00
Total Carriers 21.00
Total Arrived Flights (Count Data) 62,146,805.00
Total Delayed Flights (15+ min) 11,375,095.00
- Carrier Delays (31.34%) 3,565,080.59
- Weather Delays (3.39%) 385,767.94
- NAS Delays (29.21%) 3,322,432.52
- Security Delays (0.24%) 26,930.39
- Late Aircraft Delays (35.82%) 4,074,891.00
Total Cancelled Flights 1,290,923.00
Total Diverted Flights 148,007.00
Cancelled Flights (%) 2.08
Diverted Flights (%) 0.24

Code for Model

Trace Plots and Posterior Distributions

Model Parameters and Estimates

Parameter Estimate Standard Error 95% Credible Interval
Intercept -2116.53 7.67 [-2131.41, -2100.91]
Weather Count 1041.97 2.66 [1036.73, 1047.15]
Sigma 8676.19 15.52 [8646.95, 8706.92]

Model Diagnostics and Fit Statistics

Statistic Value
Number of Observations 171,426
Model Family Gaussian
Formula arr_delay ~ weather_ct
Iterations 2000
Warmup 1000
Chains 4
Effective Sample Size (Bulk) [Intercept, Weather Count] [2102.722, 2000.139]
Effective Sample Size (Tail) [Intercept, Weather Count] [2095.692, 1858.849]
Mean Arrival Delay (minutes) 1041.966
Median Arrival Delay (minutes) 1041.971
Standard Deviation of Arrival Delay 2.660956
95% Credible Interval for Mean Arrival Delay [1036.731, 1047.15]

Posterior Distribution for Weather Count Coefficient

Posterior Predictive Check

Conclusion

Key Findings

  • Intercept: -2116.53 (95% CI: [-2131.41, -2100.91])

    • Indicates significantly shorter delays without weather incidents.
  • Weather Count Coefficient: 1041.97 (95% CI: [1036.73, 1047.15])

    • A 1-unit increase in weather incidents leads to an average 1042-minute delay.

    • Weather incidents are infrequent but highly disruptive.

  • Uncertainty Measures:

    • Residual variability: Standard deviation = 8676.19.

    • Suggests other unmeasured factors affecting delays.

  • Model Diagnostics:

    • Rhat = 1.00 for all parameters, indicating convergence.

    • Large effective sample sizes ensure reliable posterior estimates.

Conclusion

  • Key Insight:

    • Weather-related incidents, though infrequent, have a disproportionately large impact on delay times.

    • Highlights the need for better weather management and forecasting.

  • Bayesian Approach:

    • Accounts for uncertainty, providing credible intervals for estimates.

    • Supports informed decision-making in airline operations and policy-making.

Discussion and Future Research

  • What other factors could be included in the model?

  • How could expanding the dataset improve insights?

  • What advanced Bayesian methods could be explored?

  • How should outliers be addressed?

  • What assumptions should be revisited?

Thank You! Questions?

References

[1]
M. J. Zyphur and F. L. Oswald, “Bayesian estimation and inference,” J. Manage., vol. 41, no. 2, pp. 390–420, Feb. 2015.